CTNet: a convolutional transformer network for EEG-based motor imagery classification

Brain-computer interface (BCI) technology bridges the direct communication between the brain and machines, unlocking new possibilities for human interaction and rehabilitation. EEG-based motor imagery (MI) plays a pivotal role in BCI, enabling the translation of thought into actionable commands for interactive and assistive technologies. However, the constrained decoding performance of brain signals poses a limitation to the broader application and development of BCI systems. In this study, we introduce a convolutional Transformer network (CTNet) designed for EEG-based MI classification. Firstly, CTNet employs a convolutional module analogous to EEGNet, dedicated to extracting local and spatial features from EEG time series. Subsequently, it incorporates a Transformer encoder module, leveraging a multi-head attention mechanism to discern the global dependencies of EEG's high-level features. Finally, a straightforward classifier module comprising fully connected layers is followed to categorize EEG signals. In subject-specific evaluations, CTNet achieved remarkable decoding accuracies of 82.52% and 88.49% on the BCI IV-2a and IV-2b datasets, respectively. Furthermore, in the challenging cross-subject assessments, CTNet achieved recognition accuracies of 58.64% on the BCI IV-2a dataset and 76.27% on the BCI IV-2b dataset. In both subject-specific and cross-subject evaluations, CTNet holds a leading position when compared to some of the state-of-the-art methods. This underscores the exceptional efficacy of our approach and its potential to set a new benchmark in EEG decoding.


Related work
Motor imagery classification with CNN CNN has shown effectiveness in automatically extracting spatial features and classifying EEG signals, and it has gradually led to superior performance in MI-EEG data analysis tasks.Schirrmeister et al. introduced Shal-lowConvNet and DeepConvNet architectures inspired by the FBCSP transformation 27 .The ShallowConvNet comprises two convolutional layers, an average pooling layer, and a fully connected layer with a softmax classifier.The first convolution operates across time, while the second convolution functions across space, serving as an analogue to the CSP spatial filter.Compared to ShallowConvNet, DeepConvNet enhances its feature network by incorporating three additional convolutional-pooling blocks.Each of these blocks consists of a convolutional layer followed by a pooling layer, with all pooling operations in DeepConvNet, utilizing max pooling for optimal feature extraction.Lawhern et al. developed the notable EEGNet architecture, an advancement over the Con-vNet structure 28 .EEGNet is composed of three types of convolution operations: temporal convolution, channel depth-wise convolution, and separable convolution.EEGNet stands out as a versatile architecture suitable for various BCI paradigms, achieving impressive results even when trained on limited data sets.Chowdhury et al.  conducted an in-depth study on EEGNet and introduced a five-branch CNN architecture (EEGNet fusion V2) aimed at enhancing cross-subject motor imagery classification 31 .Each branch of this network incorporates EEGNet with varied hyperparameters to optimize feature fusion and classification performance.Their innovative approach demonstrated promising results across multiple datasets including eegmmidb, BCI IV-2a, and BCI IV-2b, significantly outperforming established models such as EEGNet, ShallowConvNet, and DeepConvNet in cross-subject scenarios.Ingolfsson introduced EEG-TCNet, a novel network that integrates EEGNet and TCN in a sequential manner 36 .This network is characterized by its low memory usage and reduced computational complexity, making it particularly well-suited for embedded classification in resource-constrained environments, such as edge devices.When evaluated on the BCI IV-2a dataset, EEG-TCNet demonstrated a notable classification accuracy of 77.35%.
Beyond EEGNet, researchers have introduced several other impactful methods for CNN-based classification of MI-EEG.Jia et al. introduce an innovative CNN-based framework that integrates spatial and temporal information processing to enhance the classification of motor imagery EEG signals 37 .The proposed model, featuring a time-contained spatial filtering (TSF) and a spatial-emporal analysis network (STAN), achieved average accuracies of 83.0% on the BCI IV-2a dataset and 88.0% on the BCI IV-2b dataset.The TSF-STAN model significantly surpasses existing approaches by effectively managing both spatial and temporal features.Amin et al. introduced MCNN, a fusion of multiple CNN models designed to harness diverse convolutional features for capturing both spatial and temporal elements from raw EEG data 29 .The MCNN achieved a classification accuracy of 75.72% on the BCI IV-2a dataset, and an impressive 95.4% on the High Gamma Dataset.Zhao et al. introduced a multi-branch 3D convolutional neural network (M3DCNN) tailored for MI classification, employing a novel approach to transform EEG signals into sequences of 2D arrays that retain the spatial distribution of sampling electrodes 30 .This innovative design allows the M3DCNN model to effectively capture both spatial and temporal dynamics of EEG data.For subject-specific analysis, the M3DCNN model demonstrated a commendable performance, achieving an accuracy of 71.02% and a kappa coefficient of 0.644 on the BCI IV-2a dataset.In the more challenging cross-subject evaluation, the model managed to attain an accuracy of 52.17% and a kappa coefficient of 0.453.Sakhavi et al. developed a classification framework for MI data that involves a novel temporal representation, created by modifying the FBCSP method, and a tailored CNN for classification 32 .This innovative approach significantly improved performance, outshining the existing method in the literature on the BCI competition IV-2a dataset with a 7% increase in average subject accuracy.

Attention-based transformer architecture
The attention mechanism is an effort to emulate the human brain behavior of selectively focusing on a few significant elements while ignoring others.The self-attention mechanism has the intrinsic ability to evaluate global dependencies on very long sequences.A few studies attempted to adopt Transformer models for MI-EEG classification.Tao et al. 46 implemented a modified version of the Transformer, known as the gated Transformer, for EEG signal analysis.This model employs a gating mechanism instead of traditional residual connections to learn feature representations along a sequence of embeddings.The gated Transformer demonstrated improved results in the PhysioNet dataset.Song et al. introduced an EEG decoding method focusing on self-attention mechanisms 47 .It starts with preprocessing and spatial filtering of EEG data, followed by applying attention transformations on the feature-channel dimension to emphasize key spatial features.The critical step involves segmenting the data temporally for attention processing, yielding distinct representations.Xie et al. crafted five Transformer-based models for MI-EEG classification, leveraging the Transformer model's capabilities and the spatial-temporal attributes of EEG signals 48 .These models achieved top classification accuracies of 83.31%, 74.44%, and 64.22% in two-, three-, and four-class motor-imagery tasks respectively during cross-individual validation on the PhysioNet dataset.They also highlighted that integrating positional embedding modules into the Transformer could further enhance EEG classification performance.However, these models bypass the importance of learning local features, which are crucial for EEG signal decoding.www.nature.com/scientificreports/Additionally, a self-attention module is directly integrated to extract global correlations from the local temporal features.The EEG Conformer showcased notable accuracies of 78.66% on the BCI IV-2a dataset, 84.63% on the BCI IV-2b dataset, and 95.30% on the SEED dataset.Altaheri et al. developed an attention-based TCN (ATCNet) to enhance the accuracy of EEG-based motor imagery classification 54 .ATCNet utilizes a convolutional module, similar to EEGNet, for extracting local features from EEG trials, multi-head self-attention for emphasizing the most significant features in MI-EEG data, a temporal convolutional network for high-level temporal feature extraction, and a convolutional-based sliding window for efficient MI-EEG data augmentation.Through 10 runs executed on all subjects in the BCI IV-2a dataset, ATCNet achieved an average accuracy of 81.98% across all runs.
For cross-subject motor imagery BCIs, Keutayeva and Abibullaev explore the efficacy of attention mechanism-based models 55 .The study compares three models (ViT, Hybrid Spatial CNN + ViT, and Hybrid Temporal CNN + ViT) using the Leave-One-Subject-Out (LOSO) cross-validation method.Results highlight the effectiveness of the Hybrid Temporal CNN + ViT (HTCV) model on the BCI IV-2a dataset, though it underperforms on the BCI IV-2b dataset.This emphasizes the data size limitations affecting the performance variability of transformer models.Building upon reference 55 , Keutayeva and Abibullaev delve deeper into attention mechanisms, proposing the Spatio-Temporal CNN + ViT (st-CViT) model 56 .This study offers a comprehensive comparison of attention-based models, utilizing data augmentation techniques and employing nested LOSO for robust model selection.Their findings demonstrate that combining attention mechanisms with deep learning models significantly enhances the robustness and accuracy of subject-independent BCIs, particularly when leveraging enriched data environments.
Our research is an ongoing contribution to these works.Therefore, inspired by the works above, we propose the CTNet as an efficient backbone for MI-EEG decoding.

Datasets
Two publicly available benchmark datasets, BCI IV-2a 57 and IV-2b 58 are used to evaluate the validity of our proposed end-to-end model.These two datasets are provided by the Graz University of Technology.The details of these two datasets, preprocessing and data augmentation are described below.
(1) BCI IV-2a dataset: The IV-2a dataset contains four MI tasks: left-hand, right-hand, both feet and tongue, which is available for download at the following link: [https:// www.bbci.de/ compe tition/ downl oad/ compe tition_ iv/ BCICIV_ 2a_ gdf.zip].The dataset encompasses recordings from nine subjects (namely A01-A09), each participant contributed EEG data obtained through 22 Ag/AgCl electrodes with inter-electrode distances of 3.5 cm.The signals were sampled at 250 Hz, and the data underwent bandpass filtering from 0.5 to 100 Hz, with an additional 50 Hz notch filter activated to suppress line noise.Subjects participated in two sessions on separate days, with the first session allocated for training and the subsequent one for testing.Within each session, there were 288 trials, encompassing 72 trials per task.We used the temporal segment of 2,6 seconds in our experiments.Each trial had the shape of a matrix of dimension (22, 1000).(2) BCI IV-2b dataset: The IV-2b dataset contains the recording of nine subjects (namely B01-B09), involving left-hand and right-hand MI activities, which is available for download at the following link: [https:// www.bbci.de/ compe tition/ downl oad/ compe tition_ iv/ BCICIV_ 2b_ gdf.zip].Three bipolar recordings (C3, Cz, and C4) were recorded with a sampling frequency of 250 Hz.The recorded EEG signals were band-pass filtered from 0.5 to 100 Hz with a notch filter at 50 Hz set at the time of recording using signal acquisition hardware.Each subject participated in five sessions, where the first three sessions were for calibrating an EEG decoder and the rest sessions were for test purposes.There are about 400 trials and 320 trials in the training and test sets, respectively.We used the where X i ′ ∈ R C×T denotes the output of normalization, µ and δ represent the mean value and the standard deviation of raw trial X i , respectively.The model's purpose is to associate each MI trial input, denoted as X i ′ , with its corresponding category y i .This process involves utilizing a set of m MI trials labeled as , where y i ∈ {1, ..., N} , and N is the total number of defined categories for set S. For the BCI IV-2a dataset, C = 22 EEG channels, T = 1000 time samples, N = 4 MI categories, and m = 5184 MI trials.For the BCI IV-2b dataset, C = 3 EEG channels, T = 1000 time samples, N = 2 MI categories, and m = 6520 MI trials.(4) Data augmentation: The available EEG datasets for MI are limited, which can lead to overfitting when training deep learning models.To overcome this problem, an augmentation scheme needs to be introduced that may help in using a small amount of available data in an optimal way for training deep learning model.Existing data augmentation strategies include adding Gaussian white noise, cropping data, segmentation and recombination (S&R), and so on.In this study, we employ S&R in the time domain to augment the training set.Following 59 , the augmentation process involves equally dividing each EEG training trial into several non-overlapping segments, followed by the generation of new artificial trials.The generation is (1)  where [A, B,  C] represent the sequential concatenation of samples from segments A, B, and C, respectively.R k is a randomly selected integer from the range [1, M].This entire methodology is depicted and explained in detail in Fig. 1.

Overall framework of proposed model
In this study, we have developed a convolutional transformer model for MI-EEG decoding, incorporating the MHA mechanism 45 .The overall framework of our model is depicted in Fig. 2. The framework facilitates an endto-end classification of EEG trials, eliminating the need for exploring handcrafted feature extraction techniques.The proposed model consists of three main components: a convolutional module, a Transformer encoder module and a fully connected classifier.The convolutional module encodes low-level spatial-temporal information within the MI-EEG trial through three convolutional layers: a temporal convolution, a channel depth-wise convolution and a spatial convolution.The convolutional module processes the standardized EEG trial as input.It accounts for both the temporal continuity of the sampled EEG signals and the functional inter-connectivity among various electrode channels.The convolutional module outputs a higher-level temporal sequence representation.Following this, a Transformer encoder employs MHA to emphasize the most critical features within this sequence.Finally, the process concludes with a compact classifier, composed of two fully connected layers, which generates the decoding results.

Convolutional module
The convolutional module in this study draws inspiration from EEGNet.We have designed the convolutional module by decomposing the two-dimensional convolution operator into two one-dimensional components: temporal and spatial convolution.Additionally, we also introduce depth-wise convolution.In contrast to EEGNet, we opt for one-dimensional convolution instead of separable convolution, resulting in improved performance.
The convolutional module consists of three convolutional layers, as shown in Fig. 2. The initial layer conducts a temporal convolution utilizing F 1 filters with a size of (1, K C1 ), where K C1 represents the filter length along the time axis.This operation outputs F 1 feature maps containing the EEG signal at different band-pass frequencies.The value of K C1 was configured to be one-fourth of the sampling rate (64).This choice enables the filters to capture temporal information associated with frequencies exceeding 4Hz.Subsequently, we employ a depth-wise convolution with F 2 filters of size (C, 1) for spatial filtering, where C represents the number of EEG trial electrode channels (22 for the BCI IV-2a dataset and 3 for the BCI IV-2b dataset).This approach facilitates the learning The principle of data augmentation.
of spatial filters corresponding to each temporal feature map, thereby efficiently extracting frequency-specific spatial filters.The depth parameter D, empirically set to 2, determines the number of spatial filters learned for each feature map.Hence, the output of the channel depth-wise convolution comprises F 1 × D feature maps.Subsequent to the depth-wise convolution, an average pooling layer with a size of (1, P 1 ) is employed to extract temporal information, down-sampling the signal's sampling rate by a factor of 8.This results in a reduced signal sampling rate of approximately 32 Hz.The third convolutional layer, a spatial convolution, comprises F 2 filters of size (1, K C2 ).We set K C2 to 16 to decode motor imagery (MI) activities within a 500 ms window for data sampled at approximately 32 Hz.This is followed by a second average pooling layer of size (1, P 2 ) for dimensionality reduction.The parameter P 2 regulates the EEG sequential features sequence length (token size), which is then fed into the Transformer encoder.
All convolutional layers are succeeded by batch normalization (BN), which enhances the training process and alleviates overfitting.Following the second and third BN layers, exponential linear units (ELUs) are employed as the activation function to introduce non-linearity.The two average pooling layers are then followed by a dropout operation.For subject-specific classification, we set the dropout probability at 0.5 to help prevent overfitting when training with small sample sizes.For cross-subject classification, where the training sets are considerably larger, we reduced the dropout probability to 0.25.In the end, the convolution module generates feature maps S ∈ R T C ×d , where T C represents the length of the EEG trial's high-level feature representation, calculated as follows.
where T is the time samples of the raw EEG trial.Here, d denotes the number of feature channels, equivalent to F 2 , which is set to 16. Subsequently, all feature channels at each time point are fed as a token into the Transformer encoder.

Transformer encoder
The Transformer network was initially introduced for machine translation, employing an encoder-decoder architecture with stacked MHA and position-wise fully connected feed forward block.Layer normalization (LN) and residual connections are incorporated to enhance the training efficiency and robustness of the model.In our study, focused on classification, we exclusively utilize the encoder block within the Transformer, comprising L layers in depth.Each layer contains two sub-layers: MHA mechanism and feed forward network.The overall architecture of the Transformer encoder is depicted in Fig. 2.
The MHA mechanism is employed to capture global temporal dependencies in the high-level representation of EEG, thereby complementing the limited receptive field in the convolution module.The MHA comprises multiple

Convolutional module
Temporal Conv www.nature.com/scientificreports/self-attention layers known as heads, as illustrated in Fig. 3.The model transforms the input by multiplying S with three distinct weight matrices, thereby deriving three principal components: the queries Q , keys K, and values V.
where W Q i , W K i and W V i are the projection matrix of the query, key and value of the i-th head, respectively.The "Scaled Dot-Product Attention" mechanism calculates the dot products of the query with all keys, normalizes these products by dividing them by √ d k , and then applies a softmax function to determine the weights assigned to the values.The attention score Z i of each attention head is calculated as follows: The MHA enables the model to concurrently process and integrate information from various representation sub-spaces at different positions.By executing several SA operations in parallel, with each operation targeting distinct facets of the input, the MHA provides a comprehensive analysis of the input data.The outputs from these individual SA units are then coalesced through a linear transformation.This integration enables the model to effectively discern and encapsulate a broad spectrum of data dependencies, enhancing its representational capability.
where h denotes the number of heads.As a result of this mechanism within the MHA block, every node in the network acquires a global receptive field, enabling the model to capture and integrate information from across the entire input sequence.After processing through the MHA block, the output features are combined with the original input features S through a residual connection.Subsequently, the model employs LN to standardize each feature.The output of the MHA mechanism can be expressed as: The following position-wise fully connected feed forward network is also conducted in a residual network form.It is applied to each position separately and identically.This sub-layer consists of two linear transformations with a Gaussian Error Linear Unit (GELU) activation and a dropout operation in between.The GELU activation function's formula is given by: where (x) is the cumulative distribution function of the standard Gaussian (normal) distribution, which can be expressed as: where erf(x) denotes the error function, which is a special function integral of the Gaussian distribution.Subsequently, LN is performed.Finally, the sum of the input feature and output feature is used as the output in a residual operation: where PF denotes the position-wise feed forward operation.

Classifier module
In the classification block, the convolutional module and Transformer encoder's output features are added, enabling direct transmission of features extracted by the CNN to the classifier, and then flattened.Following this, a dropout operation is applied to mitigate overfitting and enhance generalization, with a dropout probability of 0.5.Finally, these processed features are fed into a fully connected layer comprising N units, where N represents the number of categories in the MI EEG classification task.Cross-entropy is used as the loss function for the entire model, as follows: where M is the number of the EEG trials, y ij is the true label for the j-th class in the i-th sample, and ŷij is the predicted probability for the j-th class in the i-th sample.

Performance metric
To ensure a thorough evaluation, four widely recognized metrics are employed: accuracy, Cohen's Kappa, Hedges' g and Wilcoxon p-value.Accuracy is calculated as: where TP and TN represent the correct positive sample number and the correct negative sample number predicted by the model, respectively, and FP and FN denote the false positive sample number and the false negative sample number predicted by the model, respectively.Cohen's Kappa is calculated as follows: where P o denotes the accuracy of the model, and P e represents the probability or accuracy of a random guess.
To assess the effect size of the Transformer module and data augmentation operation on model performance, we use Hedges' g.This metric is particularly suitable for small sample sizes and provides a more accurate estimate of effect size by correcting Cohen's d for sample size.Hedges' g is calculated as follows 60 : where d (Cohen's d) is calculated as: where X 1 and X 2 are the means of the two groups being compared (e.g., with and without the Transformer mod- ule), s 1 and s 2 are the standard deviations of the two groups, and n 1 and n 2 are the sample sizes of the two groups, respectively.In this study, n 1 and n 2 are both equal to 9. Specifically, g around 0.2 suggests a small but potentially meaningful impact.g around 0.5 denotes a moderate impact that is likely to be of practical significance.g around 0.8 or higher indicates a large impact, which is very likely to be of substantial practical significance.These benchmarks help in interpreting the magnitude and importance of the effects in the context of experimental findings.
Additionally, we utilize the p-value derived from the paired Wilcoxon Signed-Rank Test to assess the statistical significance of differences between the proposed model and other state-of-the-art approaches.In this context, a p-value > 0.05 indicates the absence of a statistically significant difference.Conversely, a p-value < 0.05 (denoted as '*') signifies a notable statistical difference, and a p-value < 0.01 (denoted as '**') indicates a highly significant statistical difference.

Training procedure
The training of our models was conducted on an Nvidia RTX3090 with 24 GB memory GPU.We utilized PyTorch, an open-source deep learning framework, on a workstation equipped with the Debian 11 operating system and an Intel Core i9-9820X CPU.Our analysis focused exclusively on EEG-channel data, and we chose to directly discard the three electrooculography (EOG) channels without engaging in any artifact removal procedures.The proposed model is evaluated using subject-specific and cross-subject.We have reserved 30% of the initial training set to function as a validation set.The model that demonstrated the minimum loss on the validation set was selected.
For subject-specific evaluations, we adhered to the data division scheme outlined in the competition guidelines.The following training configurations were adhered to: The Adam optimizer was employed, configured with a learning rate of 0.001, and β 1 and β 2 parameters were set to 0.5 and 0.999, respectively.The loss function used was categorical cross-entropy.We set the batch size at 288 and the number of training epochs at 1000.The dropout rate p 1 is set to 0.5.Unless specified otherwise, the hyperparameters utilized across all experiments for two datasets are detailed in Table 1.These hyper-parameters were carefully selected following a series of preliminary experiments, aimed at achieving the best possible generalization of our model.We maintained consistency in these parameters for all subjects involved in the study to ensure uniformity and comparability of results.
For cross-subject evaluation, we employ the LOSO method.In this approach, we sequentially select one subject from the total of nine as the test subject, while aggregating the EEG data from the remaining subjects to form the training dataset.This process is repeated for each subject, thereby ensuring that the model is trained and evaluated on diverse subsets of data.The learning rate, batch size, and number of training epochs have been configured to 0.001, 512, and 600, respectively.The dropout rate p 1 is set to 0.25.

Comparison with state-of-the-art approaches
In our study, we performed comprehensive subject-specific and cross-subject experiments and compared our method against several state-of-the-art approaches on the BCI IV-2a and IV-2b datasets.To ensure a relatively fair comparison, we reimplemented 4 prominent models (ShallowConvNet, DeepConvNet, EEGNet, and Conformer) based on their open-source code, maintaining uniform experimental conditions including data preprocessing, training and validation set splits, and data augmentation strategies.For these models, hyperparameters were configured to align with the specifics provided in their respective papers, except for the learning rate, batch size, and training epochs.These parameters were standardized across all models, including CTNet, to facilitate a fairer comparison.

Subject-specific classification
These excellent methods include ShallowConvNet, DeepConvNet, EEGNet and TSF-STAN, which are recognized for their remarkable CNN-based end-to-end performance in EEG datasets.We also assessed the performance of the Conformer, which effectively integrates local features captured by CNNs with global features processed by the Transformer encoder.Additionally, we conducted comparisons against MI-CAT, a domain adaptation network that leverages the Transformer's self-attention and cross-attention mechanisms.The experimental results of TSF-STAN and MI-CAT adopt the data reported in their papers.
As shown in Table 2, our proposed model, CTNet, demonstrates robust performance.In the subject-specific experiments on the BCI IV-2a dataset, CTNet achieved an impressive average accuracy of 82.52%, which is only 0.48% less than the top-performing TSF-STAN model.CTNet recorded the highest classification accuracies for subjects A01, A03, and A04.Additionally, it exhibited the most consistent performance across all subjects, as evidenced by the lowest standard deviation of 9.61% and the highest Kappa score of 0.7670, indicating remarkable accuracy consistency among different subjects.The CTNet significantly surpassed the ShallowConvNet by 6.83% in accuracy (p < 0.01).Similarly, CTNet displayed comparable improvements over EEGNet.Moreover, CTNet's average classification accuracy exceeded that of DeepConvNet by 4.74%, Conformer by 4.86%, and MI-CAT by 5.71% (p < 0.05).These results collectively demonstrate the effectiveness of the CTNet, which encapsulates local features through CNN and global features via Transformer.
Table 3 provides a comparison of subject-specific classification performance against several leading algorithms using the BCI IV-2b dataset.CTNet stands out in comparison, achieving an average accuracy of 88.49% and a Kappa score of 0.7697.It also shows the most consistent performance across different subjects, as indicated by the lowest standard deviation of 9.03%.Compared to ConvNet architectures, CTNet displayed significant improvements, outperforming ShallowConvNet by 3.36% (p < 0.01) and DeepConvNet by 3.28% (p < 0.01).CTNet's www.nature.com/scientificreports/average classification accuracy surpasses that of EEGNet and TSF-STAN by 0.78% and 0.49%, respectively.Compared to the hybrid CNN and Transformer architectures of Conformer and MI-CAT, CTNet achieves higher accuracies by 2.62% (p < 0.05) and 3.21%, respectively.The average confusion matrices for CTNet are depicted in Fig. 4. Figure 4a presents the confusion matrix for the BCI IV-2a dataset, indicating that the left hand imagery is the easiest to recognize with an accuracy of 86.27%, while tongue imagery proves the most challenging with an accuracy of 79.63%.The most frequent misclassification occurred when imagined foot movements were incorrectly classified as imagined tongue movements, with a misclassification rate of 10.03%.Furthermore, Fig. 4b displays the average confusion matrix for the BCI IV-2b dataset, highlighting that the accuracies for decoding left and right hand imagery are 90.81% and 86.16%, respectively.The proportion of misclassifications where imagining left-hand movements were incorrectly identified as right-hand movements reached 9.19%, while instances of imagining right-hand movements being misclassified as left-hand movements stood at 13.84%.subjects.To assess this capability, we undertook cross-subject evaluations using the LOSO methodology, with findings detailed in Table 4 on the BCI IV-2a dataset and Table 5 on the BCI IV-2b dataset.As indicated by Table 4, on the BCI IV-2a dataset, CTNet achieved an average classification accuracy of 58.64%, which is second only to DeepConvNet by a margin of 1.51%.CTNet also ranks second in terms of standard deviation and Kappa value, at 14.61% and 0.4486, respectively.Furthermore, CTNet's average accuracy surpasses that of ShallowConvNet, EEGNet, and Conformer by 1.89%, 1.79%, and 5.23% (p < 0.05) respectively.As shown in Table 5, on the BCI IV-2b dataset, CTNet achieved the highest average classification accuracy of 76.27%.It also recorded the smallest standard deviation at 5.26% and the highest Kappa value of 0.5252.CTNet performed exceptionally well for subjects B01, B02, B03, and B09.Compared to purely CNN-based models like ShallowConvNet, DeepConvNet, and EEGNet, CTNet's average accuracy was higher by 1.99%, 1.09%, and 1.14%, respectively, and 2.75% higher than the CNN and Transformer hybrid architecture, Conformer.The funding from Tables 4 and 5 collectively underscores CTNet's superior generalization capacity.

Ablation study
A key advancement of CTNet over CNN-based methods is the incorporation of a Transformer encoder module, which employs MHA to learn global representations of high-level features in EEG trials.Additionally, data augmentation may also contribute to the final decoding results.Consequently, we performed ablation studies on subject-specific classification experiments within the BCI IV-2a and IV-2b datasets.Our ablation experiments involved individually removing S&R data augmentation, individually removing the Transformer, and removing both data augmentation and the Transformer concurrently.The results of these experiments are illustrated in Fig. 5.The effect sizes for the ablation experiments are presented in Table 6.
For the BCI IV-2a dataset, ablation experiments are depicted in the upper subplot of Fig. 5.It is evident from Fig. 5 that for the majority of subjects, removing either the Transformer module, data augmentation, or both, significantly diminishes the recognition accuracy.Specifically, excluding the Transformer alone resulted in a noticeable decrease in recognition accuracy for most subjects (A01-A05 and A09), with a particularly pronounced drop of 9.37% for subject A04; however, participants A07 and A08 did not experience any change in accuracy, and subject A06 actually showed a slight increase of 0.69%.Overall, the removal of the Transformer module led to an average notable reduction in model accuracy by 1.77% (p < 0.05, g = 0.179).The effect size for the Transformer module, as measured by Hedges' g, is 0.179.This indicates a small effect size, suggesting that the inclusion of the Transformer module has a positive impact on model performance.
When data augmentation was removed alone, all subjects except A06 exhibited a marked decrease in accuracy, averaging a significant decline of 7.21% (p < 0.01, g = 0.758).Given that Transformer models typically require substantial amounts of training data, this result indicates a medium to large effect size.This underscores the critical role of data augmentation in maximizing the performance benefits of the Transformer module, which typically requires substantial training data.Simultaneously removing both the Transformer and data augmentation resulted in an average accuracy reduction of 6.55% across all participants (p < 0.01, g = 0.595), highlighting the significant contributions of both components to recognition accuracy.
Additionally, we observed that in the absence of data augmentation, due to the limited training data, incorporating the Transformer module could actually decrease the recognition accuracy for four subjects (especially a 7.29% drop for A07), with an average decline of 0.66%.www.nature.com/scientificreports/For the BCI IV-2b dataset, as shown in the lower subplot of Fig. 5, the impact of removing the Transformer module mirrors that observed in the BCI IV-2a dataset, with a notable decrease in recognition accuracy across most participants, averaging a reduction of 1.79% (p < 0.05, g = 0.184).Removal of either data augmentation or both modules did not reduce accuracy as drastically as in the BCI IV-2a dataset, with average decreases of 1.33% (g = 0.143) and 1.32% (p < 0.05, g = 0.139) respectively.Furthermore, without data augmentation, the inclusion of the Transformer does not enhance the model's average recognition accuracy.

Effect of hyper-parameters
In this section, we meticulously assess the influence of various critical hyper-parameters on the performance of the model in subject-specific classification.These parameters encompass the length of the high-level EEG features (token size) inputted into the Transformer module, the number of self-attention heads, and the depth of the Transformer encoder architecture.
Token size T c , as determined by the second average pooling kernel, plays a critical role in the effectiveness of the Transformer encoder.Excessively large pooling kernel sizes may overly smooth temporal features, obliterating valuable details, whereas too small sizes might render the model susceptible to local noise disturbances.This necessitates a delicate balance in selecting the convolution module's second pooling size.To address this, we evaluated the effects of varying pooling sizes on model performance within the BCI IV-2a and IV-2b datasets to identify an optimal size that enables the model to discern global features without succumbing to local noise interference, as illustrated in Fig. 6.The upper and lower subplots in Fig. 6 respectively illustrate the impact of different token sizes on recognition accuracy within the BCI IV-2a and BCI IV-2b datasets.Token size T c ranged from 12 to 125.It is evident from Fig. 6 that both excessively large and small token sizes yield suboptimal performance.On the BCI IV-2a dataset, the optimal average accuracy is achieved at a token size of 20 (P 2 = 6), reaching 83.14%.This is significantly better than the models with token sizes of 12 (P 2 = 10) and 125 (P 2 = 1), which are 3.94% (p < 0.01) and 1.35% (p < 0.01) less effective, respectively.Similarly, on the BCI IV-2b dataset, the best average accuracy is observed at a token size of 15 (P 2 = 8), amounting to 88.49%.This surpasses the performance at token sizes of 12 and 125 by 2.01% (p < 0.05) and 1.73%, respectively.Hence, opting for a relatively larger pooling parameter P 2 , which results in comparatively smaller tokens, allows for more efficient utilization of EEG's local features, thereby enhancing the Transformer's capability to aggregate global features.
The number of heads is a critical parameter in the Transformer encoder, which leverages the MHA mechanism.MHA enables the model to process information concurrently, with each head concentrating on a distinct facet of the input sequence.Our research investigates the impact of varying the number of heads, as illustrated in Fig. 7, where we explore a range of head numbers h from 1 to 16.The upper and lower subplots in Fig. 7 depict the effects of varying numbers of attention heads on the model's performance across the BCI IV-2a and BCI IV-2b datasets, respectively.It is evident from the figure that the accuracy for the same subject fluctuates across different numbers of attention heads in the CTNet, indicating the model's sensitivity to the number of heads.CTNet demonstrates greater sensitivity to the number of heads in the BCI IV-2a dataset compared to BCI IV-2b.Overall, across both the BCI IV-2a and IV-2b datasets, the CTNet models equipped with two attention heads yield the best performance.On the BCI IV-2a dataset, the model with two heads surpasses those with 1, 4, 8, and 16 heads in terms of average recognition accuracy by 1.08%, 1.00%, 2.23%, and 0.58%, respectively.Similarly, on the BCI IV-2b dataset, the model with two heads outperforms those with 1, 4, 8, and 16 heads in terms of average recognition accuracy by 0.79%, 0.90%, 0.35%, and 1.12%, respectively.The Kappa index exhibits a similar pattern of results.
Depth L is a critical factor that influences the fitting capabilities of the Transformer model.We explored the effects of varying depth levels on the CTNet by incrementally increasing the number of Transformer encoder blocks from 1 to 10, as depicted in Fig. 8.The upper and lower subplots in this figure respectively illustrate the impact of Transformer depths on the CTNet model performance across the BCI IV-2a and BCI IV-2b datasets.It is evident that the distributions of recognition accuracy for different subjects' MI intentions vary with changes in the Transformer depth.Specifically, for the BCI IV-2a dataset, the model featuring a Transformer with a depth of 6 layers achieves the highest average recognition accuracy, reaching 82.52%, which is 3.09% higher than that of the model with a 10-layer Transformer.Similarly, in the BCI IV-2b dataset, a Transformer with 6 layers yields the highest average recognition accuracy at 88.49%, surpassing that of the 2-layer depth Transformer by 1.58% (p < 0.05).

Visualization of feature distributed
To demonstrate the capabilities of the proposed CTNet, we embarked on visualizing its feature extractions.Utilizing t-distributed stochastic neighbor embedding (t-SNE), a renowned technique for dimensionality reduction and visualization, we aimed to evaluate the discriminative capacity of the features extracted by our network.Figure 9 presents the visualization of feature distributions for subject A03 on subject-specific classification, comparing scenarios with and without the integration of the Transformer encoder module in both training and test datasets.Notably, the model's performance without employing the Transformer module, as depicted in Fig. 9a,b, showcases a relatively small inter-class distance among the four categorized features and a larger intra-class variance.Conversely, with the inclusion of the Transformer module, as illustrated in Fig. 9c,d, there is a marked increase in the inter-class separation and a substantial reduction in the intra-class distance.This outcome further corroborates that the fusion of CNN and the Transformer module substantially amplifies the discriminative capability of the features.

Discussion
CNN-based methodologies have demonstrated efficacy in MI-EEG classification, chiefly due to CNN's robust capability in local feature extraction.Nonetheless, CNNs typically possess a limited receptive field, potentially impeding their ability to capture global feature dependencies.The Transformer model's self-attention mechanism effectively captures long-distance dependencies within data, facilitating a comprehensive understanding of the entire input sequence.This feature is particularly critical in MI-EEG signal processing, where MI tasks involve complex cortical coordination that often spans extensive intervals in the time series.The Transformer's self-attention mechanism is essential for understanding and analyzing complex activity patterns across multiple time points.Additionally, the Transformer can dynamically adjust its focus, applying weighted attention to critical signal features within MI-EEG data, such as specific frequency band rhythm changes, thereby significantly enhancing the model's sensitivity to key information and accuracy in decoding.Based on these insights, we introduce the CTNet model, which combines the CNN's capability for local feature extraction with the Transformer's ability to process global information, offering substantial advantages in the decoding of MI-EEG signals.The efficacy of the CTNet has been validated through subject-specific and cross-subject classification experiments conducted on the BCI IV-2a and BCI IV-2b datasets.also revealed that without the TSF preprocessing step, using only the STAN network would decrease the average recognition accuracy and Kappa value on the BCI IV-2a dataset by 17.3% (p < 0.01) and 0.2260, underscoring the significant performance boost provided by the STF preprocessing step.When compared to CTNet using data augmentation, TSF-STAN showed a 0.48% higher recognition accuracy on the BCI IV-2a dataset, while CTNet had a 0.49% higher accuracy on the BCI IV-2b dataset.Compared to the without data augmented CTNet and STAN model, our CTNet model's average recognition accuracy and Kappa value were higher by 9.61% and 0.1318, respectively.Conformer and MI-CAT are exemplary models for decoding MI-EEG, utilizing a hybrid architecture that combines CNN and Transformer technologies.In 2023, combining the local feature extraction capabilities of ShallowConvNet with the global modeling strength of the Transformer, Song et al. proposed the Conformer model.On the BCI IV-2a and IV-2b datasets, the average recognition accuracies of the Conformer model were 1.97% and 0.74% higher than those of ShallowConvNet.This also demonstrates that incorporating a Transformer to globally model the high-level features extracted by CNNs can enhance the model's ability to recognize MI-EEG signals.Correspondingly, the model's trainable parameter number also increased by approximately 0.12 million.Our CTNet model, inspired by both Conformer and EEGNet, is designed to achieve high recognition accuracy while maintaining a smaller trainable parameter number, thus reducing overfitting and enhancing the model's generalization capability.In 2023, Zhang and colleagues proposed the MI-CAT model to address the inter-subject variability of EEG signals.MI-CAT employs a temporal-spatial CNN to learn feature representations from paired EEG data, followed by two domain-related attention blocks that preserve domain-dependent information.It then utilizes the Transformer's self-attention and cross-attention mechanisms to facilitate feature interaction and resolve differential distributions across different domains.Additionally, MI-CAT uses bandpass filtering (BF) and exponential moving standardization (EMS) for data preprocessing.Without data augmentation, MI-CAT achieved remarkable average recognition accuracies of 76.81% and 85.28% on the BCI IV-2a and IV-2b datasets, respectively, with Kappa values of 0.692 and 0.706.In comparison with the CTNet model, which did not use data augmentation, MI-CAT exhibited a 1.50% higher average recognition accuracy on the BCI IV-2a dataset, while CTNet performed 1.88% better on the BCI IV-2b dataset.This shows that the recognition accuracy of CTNet and MI-CAT models is comparable.However, MI-CAT has over 55,000 more trainable parameters than CTNet.
In summary, compared to state-of-the-art methods such as ShallowConvNet, DeepConvNet, EEGNet, TSF-STAN, Conformer, and MI-CAT, the CTNet model is relatively small yet achieves comparable decoding accuracy to the TSF-STAN method on both the BCI IV-2a and IV-2b datasets.Specifically, CTNet's accuracy is higher than other state-of-the-art methods by 4.74% to 6.83% on the BCI IV-2a dataset and by 0.78% to 3.36% on the BCI IV-2b dataset.Notably, while TSF-STAN utilizes a complex TSF data preprocessing method, CTNet employs a simple standardization process, greatly simplifying the preprocessing pipeline.TSF-STAN's complex TSF preprocessing requires substantial computation, whereas CTNet's straightforward standardization process reduces computational complexity and resource demands.Achieving high accuracy without sacrificing performance highlights the practical advantages of our approach.In practical applications, reducing computational  The effect size analysis provides further insights into the contributions of the Transformer module and data augmentation.For the BCI IV-2a and IV-2b datasets, when data augmentation was applied, adding the Transformer module resulted in effect sizes (Hedges' g) of 0.179 and 0.184, respectively.This suggests a small but positive impact of the Transformer module when data augmentation is utilized, highlighting its ability to enhance model performance by capturing global dependencies in the data.Conversely, when the model did not use data augmentation, adding the Transformer module resulted in effect sizes of − 0.063 and − 0.002 for the BCI IV-2a and BCI IV-2b datasets, respectively.These negative or near-zero effect sizes indicate that the Transformer module alone does not improve, and may even slightly detract from, model performance without data augmentation.These findings also indicate that combining CNN with a Transformer, especially without data augmentation, leads to a decrease in recognition accuracy.This decline is likely due to the introduction of the Transformer module, which increases the model's trainable parameters more than fourfold, thereby exacerbating issues of overfitting.Transformers are equipped with a large number of parameters and layers that are advantageous for capturing complex patterns in extensive datasets but can lead to overfitting when the training data is scarce.Under these conditions, the model may start memorizing noise and specific details of the training set instead of generalizing from it.Lacking sufficient data, the Transformer's advanced mechanisms, such as multi-head attention, are not fully leveraged.This scenario results in a model that is overly complex for the available data volume, consequently underperforming.These findings align with the results from Keutayeva's research 55,56 .
For the BCI IV-2a and IV-2b datasets, when the Transformer module was removed, using data augmentation resulted in effect sizes of 0.441 and -0.049 for the BCI IV-2a and BCI IV-2b datasets, respectively.This shows that data augmentation alone can have a substantial positive effect in the BCI IV-2a dataset but may have a slightly negative impact in the BCI IV-2b dataset without the Transformer.When the Transformer module was used, adding data augmentation resulted in effect sizes of 0.758 and 0.143, respectively.These results underscore the importance of the Transformer module in improving the model's performance.Its ability to capture global dependencies in the data, especially when combined with data augmentation, significantly boosts the model's effectiveness, particularly in datasets where data augmentation alone may not be sufficient.
For the BCI IV-2a and IV-2b datasets, when the model simultaneously uses both the Transformer and data augmentation, the effect sizes are 0.595 and 0.139, respectively.These results indicate that the combined use of the Transformer and data augmentation significantly enhances model performance on the BCI IV-2a dataset, while the effect is more modest on the BCI IV-2b dataset.
Overall, these results underscore the importance of the Transformer module in enhancing model performance.While data augmentation provides substantial benefits, the Transformer's advanced mechanisms, such as self-attention, are essential for fully leveraging the enriched data environment and capturing complex temporal dependencies in EEG signals.The Transformer's effectiveness is particularly pronounced when combined with data augmentation, as it significantly boosts the model's ability to generalize from enriched data.

Discussion on the impact of hyperparameters on model performance
We investigated three critical hyperparameters of the CTNet model: token size, the number of heads in the MHA, and the depth of the Transformer module.The CTNet model is sensitive to the settings of these parameters.Our findings suggest that smaller tokens effectively reduce local noise, which facilitates the learning of global features.When decoding EEG signals, capturing the spatial distribution of brainwaves is crucial.Each "head" in a Transformer can be viewed as an independent feature detector, focusing on different dimensions of information.CTNet performs best with a two-head attention mechanism.Two heads represent an optimal balance, sufficient to capture the essential spectral characteristics of µ rhythm (8-13Hz) and β rhythm (13-30Hz), while a higher number of heads could exceed the processing needs required for the complexity of MI-EEG signals, potentially reducing overall efficiency and effectiveness.Furthermore, CTNet achieves optimal recognition performance with a Transformer encoder of depth 6, mirroring findings from Conformer studies.Feature visualization further affirmed that the Transformer encoder facilitates learning more discriminative features than those extracted without the Transformer.

Limitations and future work
Although CTNet has demonstrated superior performance in both subject-specific and cross-subject MI-EEG decoding across two datasets, outperforming several advanced methods in terms of recognition accuracy, it still faces certain limitations.Firstly, there is significant room for improvement in CTNet's recognition accuracy, especially in cross-subject MI-EEG decoding tasks.Secondly, CTNet appears sensitive to specific hyperparameters such as token size, the number of heads in the MHA, and the depth of the Transformer module.This sensitivity might necessitate extensive hyperparameter tuning to achieve optimal performance, which can be time-consuming and computationally demanding.Additionally, the S&R data augmentation strategy does not significantly contribute to the recognition accuracy of subject-specific MI-EEG decoding on the BCI IV-2b dataset.Moving forward, we plan to explore regularization strategies specifically aimed at addressing cross-subject variability, which may enhance the model's recognition performance in cross-subject MI-EEG decoding.To address the issue of hyperparameter sensitivity, we will attempt to automate the search for optimal hyperparameter combinations using reinforcement learning-based methods.Furthermore, we also intend to explore the use of generative adversarial networks (GANs) to enhance the training dataset.
Zhang et al. ingeniously integrated the Transformer with domain adaptation (DA) to address the variability in EEG signal distribution across different subjects.They introduced a cross-attention Transformer domain adaptive network, named MI-CAT.It achieves an average classification accuracy of 76.81% on the BCI IV-2a dataset and 85.26% on the BCI IV-2b dataset 24 .Drawing inspiration from ShallowConvNet and previously mentioned Transformer models, Song et al. introduced a compact convolutional Transformer named EEG Conformer, designed to capture both local and global features within a unified EEG classification framework 53 .The model utilizes ShallowConvNet for learning local features through one-dimensional temporal and spatial convolution layers.
accomplished by concatenating these segments in their original temporal order, each sourced from different but randomly chosen training trials belonging to the same category, thereby preserving the inherent time order.More formally, let us denote ∈ {X i ′} , i ∈ [1, M] as the set of M EEG trials that are available for training for a given category.Each training trial X i ′ is split into K continuous segments X .The generation of the artificial trial is represented by Xi

Table 1 .
Global hype-parameters used for all subjects.

Table 2 .
Subject-specific classification accuracy (in percentage %) and Kappa of state-of-the-art algorithms on the BCI IV-2a dataset.+ Reimplemented.The bold font highlights the best result among the different methods.

Table 3 .
Subject-specific classification accuracy (in percentage %) and Kappa of state-of-the-art algorithms on the BCI IV-2b dataset.+ Reimplemented.The bold font highlights the best result among the different methods.

Table 4 .
Cross-subject classification accuracy (in percentage %) and Kappa of state-of-the-art algorithms on the BCI IV-2a dataset.+ Reimplemented.The bold font highlights the best result among the different methods.

Discussion on subject-specific classification In
our study, we compare CTNet with leading algorithms based solely on CNN architectures (ShallowConvNet, DeepConvNet, EEGNet, TSF-STAN) and those combining CNN and Transformer frameworks (Conformer and MI-CAT) for subject-specific MI-EEG decoding.We reimplemented the models from the open-source code of ShallowConvNet, DeepConvNet, EEGNet, and Conformer to ensure a fair comparison under identical experimental conditions.To conduct a thorough comparison of these state-of-the-art algorithms, we delve deeper into four aspects: data preprocessing, data augmentation strategies, model architecture, and the quantity of trainable network parameters, with comparative results presented in Table7.In 2017, inspired by the FBCSP algorithm, Schirrmeister et al. introduced the notable ShallowConvNet and DeepConvNet models.ShallowCon-vNet, employing just two one-dimensional convolutions (temporal and spatial), achieved notable results, with average accuracies and Kappa values of 75.69% and 0.6759 on the BCI IV-2a dataset, and 85.13% and 0.7026 on the BCI IV-2b dataset, respectively.DeepConvNet, building upon ShallowConvNet by adding three convolutionpooling blocks, improved performance on both datasets, particularly achieving a 2.09% and 0.08% increase in average accuracy and Kappa values on the BCI IV-2a dataset.The cost, however, was an increase of over 200,000 in the number of trainable parameters.In 2018, Lawhern et al. proposed the advanced EEGNet model.This model introduced depth-wise and separable convolutions, significantly reducing the quantity of trainable parameters to 2.9k for the BCI IV-2a dataset and 2.1k for the BCI IV-2b dataset, effectively mitigating overfitting.Among all compared models, EEGNet had the fewest parameters, which greatly aids in reducing training times and deploying the model on memory-constrained devices.EEGNet's average recognition accuracies on the BCI IV-2a and BCI IV-2b datasets were 1.70% and 2.58% higher than those of ShallowConvNet, with Kappa values also improving by 0.0227 and 0.0516, respectively.In 2022, Jia et al. introduced the TSF-STAN model.This model initially leverages time-contained spatial filtering for data preprocessing to increase the inter-category difference of EEG signals while preserving temporal features; it then utilizes a CNN-based spatial-temporal analysis network to further exploit discriminative spatial and temporal features and classify different EEG categories in an end-to-end process.Even without data augmentation, TSF-STAN's average recognition accuracies on the BCI IV-2a and BCI IV-2b datasets were 5.61% and 0.29% higher than those of EEGNet employing data augmentation.The performance might further improve if TSF-STAN utilized data augmentation.Ablation studies by Jia et al.

Table 7 .
Comparative analysis of state-of-the-art algorithms for subject-specific classification.+Reimplemented.The bold font highlights the best result among the different methods.Furthermore, the global processing is particularly beneficial in datasets enriched through data augmentation, as it helps the model to generalize better across varied yet synthetically expanded data.